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HEARTS:  A  DIALECT  OF  THE  POKER  PROGRAMMING  ENVIRONMENT 
SPECIALIZED  TO  SYSTOLIC  COMPUTATION 


LAWRENCE  SNYDER 


INTRODUCTION 


B«cmua«  systolic  algorithms  art  commonly  thought  of  as  being  directly  implemented  as 
hardware  arrays,  writing  systolic  programs  would  appear  to  be  an  activity  without  application, 
and  therefore  without  need  for  a  programming  environment.  But  the  appearance  is  deceiving. 
There  are  many  times  when  one  indeed  does  program  systolic  algorithms:  when  the  systolic 
array  is  programmable,  during  the  design  process  (for  simulation  purposes)  of  hardwired  array 
implementations,  when  a  systolic  algorithm  is  used  on  a  general  purpose  parallel  computer, 
or  when  one  is  engaged  in  research  on  systolic  algorithms.  Furthermore,  systolic  arrays  share 
with  other  parallel  algorithms  the  characteristic  of  being  deceptively  complex  -  perhaps  their 
simplicity  makes  the  deception  even  greater  -  so  the  benefits  of  a  programming  environment 
to  ease  the  programming  task  becoms  very  important.  In  this  paper  we  describe  the  design 
of  a  parallel  programming  environment  specialised  to  systolic  computation  and  illustrate  its 
use. 

The  “root  language”  from  which  the  current  proposal  derives  is  the  Poker  parallel  pro¬ 
gramming  environment  (Snyder,  1084].  Although  Poker  was  originally  developed  for  the  CHiP 
family  of  computers  (Snyder,  1081],  including  the  Pringle  (Kapauan,  et  aL  1084],  the  language 
has  exhibited  much  wider  applicability.  Specifically,  Polar  is  being  retargetted  (Snyder  and 
Socha,  1086]  for  the  Caltech  Cosmic  Cube  (Seits,  1085]  and  it  has  recently  been  argued  (Sny¬ 
der,  1986]  to  be  ideal  as  the  basis  for  a  systolic  programming  environment.  The  design  of  this 
new  environment,  called  Hearts,  is  the  topic  of  the  present  paper. 

One  of  the  key  ideas  of  Poker,  inherited  by  Hearts,  is  the  novel  notion  of  “program" . 
Specifically,  the  programmer’s  view  of  the  program  appears  to  be  a  dynamic  version  of  a  text¬ 
book  illustration.  Sueh  metaphorically  rich  pictures  correlate  closely  with  the  programmer’s 
thinking  and  thus  simplify  the  programming  task.  (The  next  section  is  an  example  illustrating 
this  similarity  between  the  textbook  and  Hearts  descriptions.) 

Although  Poker  was  just  described  as  a  "language”,  we  also  use  the  name  to  refer  to 
the  whole  programming  environment.  In  this  sense  Poker,  and  by  extension  Hearts,  is  an 
integrated  set  of  facilities  providing  full  support  for  editing,  compiling,  assembling,  loading, 
tracing,  debugging,  as  well  as  an  interface  to  the  underlying  UNDCTW  operating  system,  cer¬ 
tain  correctness  checks,  the  external  file  system,  library  and  “help”  facilities.  The  system 
is  so  complete  that  the  user  need  not  exit  until  the  session  is  over;  moreover,  the  system 
is  sufficiently  well  integrated  that  the  user  moves  effortlessly  between  the  various  “subsys¬ 
tems*  almost  unaware  that  the  change  of  activity  has  required  different  support  from  the 
environment.  _  „ 
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illustrated  in  the  present  example,  for  synchronization  reasons.  In  any  case,  more  than  one 
process  may  be  defined,  and  processes  may  haye  parameters  in  order  to  particularize  them 
more  easily. 

The  sequential  language  used  for  process  definition  requires  some  extensions  and  some  spe¬ 
cialized  semantics  to  simplify  programming  systolic  computations.  In  the  declarations,  trace 
variables  provide  a  means  of  dynamically  displaying  the  computation,  and  ports  declares  the 
names  to  be  used  for  the  processor's  datapaths.  (Both  features  are  further  explained  below.) 
The  arrow  operator, 

<  variable  >  <-  <  portname  > 

<  portname  >  <•  <  variable  > 

specifies  input  or  output  depending  on  whether  the  port  name  is  on  the  right  or  left,  respec¬ 
tively.  The  — are  that  a  scalar  value  is  transmitted,  reading  from  a  part  name  that 
does  not  correspond  to  a  datapath  (see  Figure  S)  yields  a  0,  and  writing  to  a  port  name 
that  does  not  correspond  to  a  datapath  is  a  noop.  The  command  tock  is  a  synchronization 
specifier:  Control  waits  at  the  tock  and  proceeds  only  when  the  global  clocking  signal  is 
received1. 


Figure  2 

The  process  code  appears  to  be  somewhat  more  complex  than  might  seem  to  be  needed 
based  on  the  original  Kung  and  Leiaerson  description,  but  what  complexity  there  is  can  be 
attributed  to  declarations  and  to  providing  three  entry  points  to  the  compute-idle-idle  cycle. 
Specifically,  the  process  is  parametrised  with  an  integer,  cycle,  specifying  whether  the  process 
should  begin  by  executing  its  inner  product  step  (cyde=0),  should  execute  the  inner  product 
after  one  idle  step  (cycle*  1)  or  after  two  (cycle*: 2).  A  second  parameter,  laetval  of  type 
port,  states  which  input  stream's  termination  will  signal  termination  of  the  process.  This 
specification,  permitting  values  to  "drain*  out  of  the  array,  is  required  by  the  fact  that, 
although  the  end  of  the  C  array  stream  generally  signals  the  end  of  the  processing,  C  is 
actually  created  internally  (the  processors  on  the  east  and  south  sides  of  the  array  simply 
read  the  Cm  port  which  returns  the  default  value  0  because  (see  Figure  6)  they  do  not  label 
an  actual  data  path)  and  thus  will  never  terminate;  the  end  of  the  A  and  B  array  streams 
will  terminate  processing  for  processors  on  the  south  and  east  edges  of  the  array,  respectively. 
The  choice  of  terminator  for  the  corner  element  is  arbitrary.  The  streams  are  terminated  by 
a  special  token,  BIOS,  mnemonic  for  end  of  stream.  The  remainder  of  the  code  should  be 
self-explanatory.  (The  branches  into  the  body  of  the  repeat-until  loop  violate  the  tenets  of 
software  engineering  and  are  used  here  only  to  reduce  confusion  in  this  presentation;  several 
more  acceptable  but  more  verbose  solutions  are  available.} 


‘Tick*  is  also 
paper. 


used  la  the  system  to  measure  time,  but  a  fall  explanation  goes  beyond  the  scope  of  this 
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code  inner  (cycle,  lastval); 
trace  a,  b,  c; 

port*  Ain,  Bin,  Cin,  Aout,  Bout,  Cout; 
begin  real  a,  b,  c;  int  cycle;  port  laatval; 


if  cycle  =  0  then  go  to  LO 

if  cycle  =  1  then  go  to  LI 

a:=  b:=  c:=  0; 

repeat 

c:=  c  +  a  *  b; 

Aout  <-  a.  Bout  <-  b,  Cout  <-  c; 
L2:  tock; 

Ll:  tock; 

LO:  tock; 

a  <-  Ain,  b  <-  Bin,  c  <-  Cin; 
until  EOS(laatval); 

Aout  <-  a.  Bout  <-  b,  Cout  <-  c; 


/*  Define  cycle  0  entry  pt */ 
/*  Define  cycle  1  entry  pt*/ 
/*  Cycle  2  entry  point  */ 


Figure  3 


end. 


Process  Assignment  The  procaw  awigntnant  activity  aaaociatw  procaaaw  and  their  ac¬ 
tual  parameters  with  procaaaora.  The  specification  is  ghran  uaing  a  modified  version  of  the 
communication  atructura  graph  in  which  a  window  is  provided  for  each  vertex.  The  name  of 
the  procaw  to  be  executed  on  the  proceaaor  ia  entered  in  the  window  together  with  its  actual 
parametere,  if  any. 

The  matrix  multiplication  problem  uaw  only  one  procaw  having  two  parametere,  cycle 
and  lastval,  although  it  might  have  bwn  equally  convenient  to  uw  three  proceww  -  one  each 
of  the  different  poeitiona  of  the  inner  product  step  in  the  compute-idle- idle  cycle  -  in  which 
caw  the  variable  cycle  would  not  be  required  [Snyder,  1086).  Notice  that  the  actual  valuw  for 
cycle  line  up  (on  counter  diagonala);  the  processors  in  the  main  block  of  the  array  all  get  their 
last  value  from  Cin;  the  choice  of  Ain  or  Bin  w  the  actual  value  for  lastval  was  arbitrary  for 
the  corner  proceaaor. 

Port  Nome  Assignment  Port  naming  simply  labels  the  datapaths  incident  to  a  processor 
with  identifiers  used  in  the  procaw  definition.  As  with  the  process  assignment  a  modified  ver¬ 
sion  of  the  communication  structure  graph  is  used,  but  for  port  naming  the  window  associated 
with  each  vertex  is  divided  into  eight  ‘panes*  corrwponding  to  the  eight  coknpam  points.  The 
port  namw,  although  clipped  to  five  characters  each  in  the  display,  are  of  arbitrary  length. 

The  specification  of  the  port  namw  for  the  matrix  product  program  is  straightforward; 
for  example,  Bin  is  in  the  northern  most  pane  and  Bout  in  the  southern  most  because  the  B 
matrix  flows  from  the  top  to  bottom.  Notice  that  if  a  port  name  labels  a  direction  and  there 
is  no  datapath  incident  to  the  processor  at  that  compaw  point  then  reading  that  port  yields 
a  0  and  writing  to  it  is  a  noop. 

Stream  Nome  Assignment.  A  stream  is  the  sequence  of  valuw  entering  or  leaving  a  systolic 
array  from  its  perimeter.  The  purpose  of  stream  name  assignment  is  to  specify  the  direction  of 
data  flow  and  to  organise  the  streams  together  into  logical  units  by  associating  like  namw  and 
indiew.  Thew  logical  units  can  then  be  bound  to  file  namw,  thereby  providing  the  interface 
between  the  Hearts  system  and  the  underlying  file  system. 
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The  stream  names  are  given  in  %  (able,  one  entry  per  "dangling*  edge,  or  pad.  For  each  pad 
a  name,  a  (unique  for  that  name)  index  and  direction  of  flow  are  specified.  To  the  right  of  the 
vertical  line,  the  table  contains  copious  information  derived  from  the  other  four  constituent 
parts  of  the  program;  this  information  is  provided  by  the  system  for  its  mnemonic  value. 

The  diagonals  of  the  A  and  B  arrays  are  stream  inputs  to  the  program.  The  fact  that 
each  array  uses  one  stream  name  with  indices  implies  that  each  array  will  be  stored  in  its 
own  file,  since  file  names  can  be  bound  to  stream  names.  The  indices  are  used  to  specify 
the  position  of  the  stream  within  the  file:  A  file  with  k  streams  each  of  n  values  will  contain 
n  fixed  length  records  each  with  k  fields;  the  index  specifies  the  field  position.  Thus,  the 
assignment  of  indices  for  the  A  and  B  array  is  dictated  by  how  the  two  band  matrices  are  to 
be  stored  in  the  file.  Similarly,  the  specification  of  indices  for  the  C  array  dictates  how  the 
streams  are  to  be  composed  into  a  file. 

Notice  that  the  systolic  program  is  defined  in  Hearts  by  three  pictures  (the  communication 
structure  definition,  process  assignment,  and  port  name  assignment),  a  segment  of  sequential 
program  text  (process  definition),  and  a  table  (stream  name  definition);  these  are  not  illustra¬ 
tions  of  the  program,  they  ere  the  program.  Hence,  the  Hearts  program  exhibits  the  pictorial 
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qualitfes  oFThe  textbook  form,  though  it  accomplishes  them  in  a  somewhat  different  way. 

With  the  form  of  an  example  program  fully  established,  it  is  now  possible  to  describe  the 
Hearts  programming  environment. 

THE  HEARTS  ENVIRONMENT 


Hearts  is  an  integrated  parallel  programming  environment  using  interactive  graphics  to 
give  the  programmer  all  of  the  facilities  needed  to  write,  debug  and  execute  systolic  programs. 
In  this  section  we  describe  the  main  features  of  the  environment. 

Hearts,  like  Poker,  does  not  have  a  textual  form  for  its  programs1 ,  but  rather  represents 
them  as  a  relational  database  and  presents  the  information  to  the  programmer  in  a  form 
called  a  view.  The  benefit  of  this  nontextual  form  is  simple:  The  programmer  need  not  go  to 
the  trouble  of  encoding  the  program  and  the  system  need  not  go  to  the  trouble  of  decoding 
(parsing)  it.  The  view,  prepared  by  the  system  at  the  programmer's  direction,  provides 
an  interface  between  human  and  machine  across  which  ths  information  about  the  runtime 
behavior  of  a  systolic  array  is  sstablishsd  without  a  textual  encoding  of  that  information. 

Two  displays  are  used  to  maximise  the  information  available  to  the  programmer:  A  pri¬ 
mary,  bitmapped  graphics  display  shows  most  of  the  graphical  views;  the  secondary  display 
is  used  only  for  process  definitions  which  the  programmer  creates  and  modifies  using  a  stan¬ 
dard  editor.  Figure  7  shows  a  typical  view  on  the  primary  display.  In  addition  to  the  four 
constituent  parts  of  a  program,  (besides  the  process  definition)  that  are  shown  as  graphical 
views,  there  are  other  views  to  support  other  facilities  of  ths  environment.  The  views  are: 

Interconnection  View.  Displays  ths  communication  structure  of  the  systolic  array; 
the  programmer  interconnects  the  processors  by  drawing  lines  with  cursor  keys  or 
a  mouse;  see  Figure  2.  [Corresponds  to  Switch  Setting  View  in  Poker]. 

Code  Names  View.  Displays  the  process  assignment  information  using  an  abstrac¬ 
tion  derived  from  the  communication  structure  where  there  is  a  window  for  each 
vertex;  the  programmer  moves  from  window  to  window  entering  the  process  name 
and  any  actual  parameters  to  the  process;  see  Figure  4. 

Port  Names  View.  Displays  the  port  name  assignment  information  using  an  ab¬ 
straction  based  on  the  communication  structure  similar  to  the  Code  Name  View; 
the  programmer  uses  the  cursor  keys  or  mouse  to  move  to  the  various  windows  and 
within  a  window  to  move  to  the  different  panes  when  the  port  name  is  entered; 
see  Figure  5. 

I/O  Namet  View.  Displays  the  stream  name  assignment  information  using  a  table, 
the  right  half  of  which  is  prepared  by  the  system;  the  programmer  moves  from  line 
to  line,  entering  the  bmmi  end  indices  of  the  streams  and  whether  they  are  input 
or  output  streams;  see  Figure  6. 

Command  Reqnett  View.  Provides  the  programmer  with  the  compilation,  assem¬ 
bly,  linking,  and  loading  functions  so  the  program  can  be  prepared  for  execu¬ 
tion;  to  emphasize  the  nonstandard  nature  of  Hearts,  one  of  the  available  facilities 
"compiles”  the  communication  graph.  Program  execution  can  be  initiated  here  in 
"production”  mode;  execution  for  debugging  programs  is  initiated  in  trace  view. 


*The  process  definition  is  textniL,  of  course,  but  this  is  only  one  of  the  five  proirum  constituents. 
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Trace  View.  Displays  the  execution  of  the  program  while  tracing  the  values  of 
the  variables  mentioned  in  the  preamble  to  the  process  definition;  like  Code  and 
Port  Names  Views,  Trace  uses  the  abstracted  version  of  the  communication  graph; 
see  Figure  7  where  the  a,  b,  and  e  values  are  being  traced  and  execution  has  been 
captured  at  the  moment  when  processor  2, 2  computes  its  first  intermediate  result. 

System  Parameters  View  Displays  the  parameters,  e.g.  the  number  of  processors, 
of  the  array  being  programmed,  and  provides  the  programmer  with  the  ability  to 
change  them.  (Corresponds  to  CHiP  Parameters  in  Poker]. 


Each  view  provides  many  other  facilities  in  addition  to  those  just  mentioned  including  screen 
management  facilities,  help  facilities,  access  to  the  underlying  operating  system,  diagnostic 
facilities,  and  so  forth. 

A  new  Hearts  programming  session  would  begin  by  defining  System  Parameters  to  specify 
the  sise  and  type  of  array  to  be  programmed,  as  well  as  other  system  characteristics.  Next 
the  programmer  will  enter  one  of  the  views  and  begin  specifying  the  information  required  for 
that  view.  Although  there  is  no  mandatory  order  in  which  the  user  visits  the  views,  there  are 
some  weak  dependencies:  a  program  must  be  defined  before  tracing  is  possible,  and  pads  must 
be  defined  in  the  communication  structure  before  stream  names  can  be  defined.  Generally, 
the  programmer  begins  with  the  Interconnection  View  to  define  the  communication  structure, 
but  then  moves  between  views  frequently  in  what  may  appear  to  the  observer  to  be  a  rather 
random  way. 

A  common  property  of  systolic  arrays  is  that  much  of  the  information  is  repetitious,  and 
so  it  would  appear  to  be  rather  tedious  to  have  to  specify  all  of  this  repeated  information.  In 
fact,  there  are  a  variety  of  facilities  to  enable  the  programmer  to  enter  repeated  information 
easily.  In  the  Code  Names  and  Port  Names  views,  for  example,  the  same  entry  can  be  assigned 
to  all  elements  of  multiple  rows  or  columns  with  half  a  dosen  key  strokes.  Similarly,  in  the  I/O 
Names  View  it  is  possible  to  assign  stream  names  with  consecutive  indices  to  multiple  pads 
occurring  in  a  pattern;  thus  for  the  Stream  Names  Definition  in  Figure  6  the  programmer  had 
to  make  only  three  complete  entries,  one  for  each  array,  and  three  repetition  commands. 

When  the  program  is  complete  the  programmer  can  either  check  it  by  running  predicate 
checks  which  verifies  that  the  program  has  certain  properties,  e.g.  all  processors  are  connected 
m  tbe  communication  structure,  or  he  can  move  directly  to  the  Command  Request  ew  to 
compile  it.  After  the  program  has  been  successfully  compiled  and  linked  it  can  either  be 
downloaded  into  the  physical  hardware,  or  down  loaded  into  a  simulator.  In  the  latter  case 
the  programmer  can  use  the  Trace  View  to  watch  a  continuously  updated  display  of  the 
progress  of  the  simulated  computation,  and  thereby  observe  bugs  in  the  program.  Should 
bugs  be  found  they  can  be  corrected  by  returning  to  the  appropriate  “source  view.” 


CONCLUSIONS 


There  are  two  key  points  about  the  Hearts  parallel  programming  environment:  the  novel 
program  structure  and  the  new  nontextual  style  of  specifying  it.  These  two  features  combine 
to  yield  a  convenient,  efficacious  programming  environment  for  writing  and  running  systolic 
programs.  In  particular,  the  metaphorically  rich  pictures  that  the  system  provides  the  pro 
grammar  offer  a  perspicuity  comparable  to  dynamic  versions  of  textbook  illustrations. 

The  Hearts  system  described  here  is  only  a  design  which  has  not  yet  been  fully  imple¬ 
mented.  The  Poker  environment  of  which  Hearts  is  a  dialect,  has  been  fully  implemented 
and  has  been  used  to  write  systolic  programs.  Not  only  has  Poker  provided  the  concepts  and 
experience  needed  to  design  Hearts,  (and  the  figures  shown  in  this  paper,  too),  it  represents 
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about  90%  of  the  facilities  of  the  Hearts  implementation.  The  major  differences  involve  the 
Interconnection  View  where  the  communication  structure  in  Hearts  is  somewhat  more  easily 
specified,  the  process  definition  language  where  more  constructs  specialized  to  systolic  arrays 
are  provided,  and  in  the  backend  simulator  where  a  synchronous  execution  mode  must  be  pro¬ 
vided;  minor  differences  abound  and  reflect  improvements  derived  from  our  experience  with 
this  style  of  programming. 
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